[superlog] Downgrade ClickHouse timeout log from ERROR to WARN in tracking health checks#468
Conversation
…cking health checks
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
|
The latest updates on your projects. Learn more about Unkey Deploy
|
Greptile SummaryThis PR downgrade-classifies ClickHouse timeout log events from ERROR to WARN in the
Confidence Score: 4/5Safe to merge — the change only affects log levels in two catch blocks; the observable behavior of the endpoint is unchanged. The fix correctly demotes noisy timeout alerts to WARN while preserving ERROR for unexpected failures. The only fragility is that the timeout sentinel string is duplicated four times as a bare literal across throw sites and catch comparisons, with no shared constant to keep them in sync. A future rename in one place would not be caught by the type system and would silently break the discrimination. packages/rpc/src/routers/websites.ts — specifically the four inline occurrences of "ClickHouse query timeout" that would benefit from a shared constant. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant isTrackingSetup
participant getTrackingEventsStatus
participant getRecentBlockedTrackingIssue
participant ClickHouse
participant Logger
Client->>isTrackingSetup: GET /isTrackingSetup
isTrackingSetup->>getTrackingEventsStatus: call
isTrackingSetup->>getRecentBlockedTrackingIssue: call
getTrackingEventsStatus->>ClickHouse: chQuery (analytics.events)
getRecentBlockedTrackingIssue->>ClickHouse: chQuery (analytics.blocked_traffic)
alt Query completes within 10s
ClickHouse-->>getTrackingEventsStatus: result rows
ClickHouse-->>getRecentBlockedTrackingIssue: result rows
getTrackingEventsStatus-->>isTrackingSetup: EventsCheckResult
getRecentBlockedTrackingIssue-->>isTrackingSetup: BlockedTrackingIssueRow or null
else Timeout fires (10s)
Note over getTrackingEventsStatus,getRecentBlockedTrackingIssue: Promise.race rejects with Error("ClickHouse query timeout")
getTrackingEventsStatus->>Logger: logger.warn (was ERROR)
getRecentBlockedTrackingIssue->>Logger: logger.warn (was ERROR)
getTrackingEventsStatus-->>isTrackingSetup: hasEvents false, recentEvents 0
getRecentBlockedTrackingIssue-->>isTrackingSetup: null
Note over ClickHouse: Query continues running in background
else Unexpected ClickHouse error
getTrackingEventsStatus->>Logger: logger.error (unchanged)
getRecentBlockedTrackingIssue->>Logger: logger.error (unchanged)
end
isTrackingSetup-->>Client: 200 tracking_issue null
|
Summary
The
isTrackingSetupdashboard endpoint occasionally logs ERROR when its ClickHouse query for blocked-traffic data exceeds the 10-second timeout. The endpoint itself recovers gracefully — it returns 200 withtracking_issue: null— so the ERROR is a false alarm that creates noisy incidents.Both
getRecentBlockedTrackingIssueandgetTrackingEventsStatusshare the samePromise.racetimeout guard and the same mis-levelled catch block. The fix differentiates between the known timeout (logged as WARN) and genuinely unexpected errors (kept at ERROR), applied symmetrically to both functions.An alternative approach would be to pass an
AbortSignalto the ClickHouse client so the underlying query is cancelled when the race times out, rather than letting it continue running in the background — that would reduce ClickHouse load during busy periods but requires ClickHouse client support for cancellation.Incident on Superlog
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Downgraded ClickHouse timeout logs to WARN in tracking health checks to stop false alarms. Unexpected errors remain ERROR; endpoint behavior is unchanged.
Written for commit 8f429e8. Summary will update on new commits.